10 research outputs found
NetShaper: A Differentially Private Network Side-Channel Mitigation System
The widespread adoption of encryption in network protocols has significantly
improved the overall security of many Internet applications. However, these
protocols cannot prevent network side-channel leaks -- leaks of sensitive
information through the sizes and timing of network packets. We present
NetShaper, a system that mitigates such leaks based on the principle of traffic
shaping. NetShaper's traffic shaping provides differential privacy guarantees
while adapting to the prevailing workload and congestion condition, and allows
configuring a tradeoff between privacy guarantees, bandwidth and latency
overheads. Furthermore, NetShaper provides a modular and portable tunnel
endpoint design that can support diverse applications. We present a
middlebox-based implementation of NetShaper and demonstrate its applicability
in a video streaming and a web service application
Packing Privacy Budget Efficiently
Machine learning (ML) models can leak information about users, and
differential privacy (DP) provides a rigorous way to bound that leakage under a
given budget. This DP budget can be regarded as a new type of compute resource
in workloads of multiple ML models training on user data. Once it is used, the
DP budget is forever consumed. Therefore, it is crucial to allocate it most
efficiently to train as many models as possible. This paper presents the
scheduler for privacy that optimizes for efficiency. We formulate privacy
scheduling as a new type of multidimensional knapsack problem, called privacy
knapsack, which maximizes DP budget efficiency. We show that privacy knapsack
is NP-hard, hence practical algorithms are necessarily approximate. We develop
an approximation algorithm for privacy knapsack, DPK, and evaluate it on
microbenchmarks and on a new, synthetic private-ML workload we developed from
the Alibaba ML cluster trace. We show that DPK: (1) often approaches the
efficiency-optimal schedule, (2) consistently schedules more tasks compared to
a state-of-the-art privacy scheduling algorithm that focused on fairness
(1.3-1.7x in Alibaba, 1.0-2.6x in microbenchmarks), but (3) sacrifices some
level of fairness for efficiency. Therefore, using DPK, DP ML operators should
be able to train more models on the same amount of user data while offering the
same privacy guarantee to their users
Web Transparency for Complex Targeting: Algorithms, Limits, and Tradeoffs
International audienceBig Data promises important societal progress but exacerbates the need for due process and accountability. Companies and institutions can now discriminate between users at an individual level using collected data or past behavior. Worse, today they can do so in near perfect opacity. The nascent field of web transparency aims to develop the tools and methods necessary to reveal how information is used, however today it lacks robust tools that let users and investigators identify targeting using multiple inputs. Here, we formalize for the first time the problem of detecting and identifying targeting on combinations of inputs and provide the first algorithm that is asymptotically exact. This algorithm is designed to serve as a theoretical foundational block to build future scalable and robust web transparency tools. It offers three key properties. First, our algorithm is service agnostic and applies to a variety of settings under a broad set of assumptions. Second, our algorithm's analysis delineates a theoretical detection limit that characterizes which forms of targeting can be distinguished from noise and which cannot. Third, our algorithm establishes fundamental tradeoffs that lead the way to new metrics for the science of web transparency. Understanding the tradeoff between effective targeting and targeting concealment lets us determine under which conditions predatory targeting can be made unprofitable by transparency tools
Vers une plus grande transparence du Web
International audienceDe plus en plus les géants du Web (Amazon, Google et Twitter en tête) recourent a la manne des « Big data » : ils collectent une myriade de données qu'ils exploitent pour leurs algorithmes de recommandation personnalisée et leurs campagnes publicitaires. Pareilles méthodes peuvent considérablement améliorer les services rendus a leurs utilisateurs, mais leur opacité fait débat. En effet, il n'existe pas a ce jour d'outil suffisamment robuste qui puisse tracer sur le Web l'usage des données et des informations sur un utilisateur par des services en ligne. Motivés par ce manque de transparence, nous avons développé un prototype du nom d'XRay, et qui peut prédire quelle donnée parmi toutes celles présentes dans un compte utilisateur est responsable de la réception d'une publicité. Dans cet article, nous présentons son principe ainsi que les résultats de nos premières expérimentations. Nous introduisons dans le même temps le tout premier modèle théorique pour le problème de la transparence du Web, et nous interprétons les performances d'Xray a la lumière de nos résultats obtenus dans ce modèle. En particulier, nous démontrons qu'un nombre θ(log N) de comptes utilisateurs auxiliaires, remplis selon un procédé aléatoire , suffisent a déterminer quelle donnée parmi les N en présence a causé la réception d'une publicité. Nous aborderons brièvement les extensions possibles, et quelques problèmes ouverts
Boost: Effective Caching in Differentially-Private Databases
Differentially private (DP) databases can enable privacy-preserving analytics
over datasets or data streams containing sensitive personal records. In such
systems, user privacy is a very limited resource that is consumed by every new
query, and hence must be aggressively conserved. We propose Boost, the most
effective caching component for linear query workloads over DP databases. Boost
builds upon private multiplicative weights (PMW), a DP mechanism that is
powerful in theory but very ineffective in practice, and transforms it into a
highly effective caching object, PMW-Bypass, which uses prior-query results
obtained through an external DP mechanism to train a PMW to answer arbitrary
future linear queries accurately and "for free" from a privacy perspective. We
show that Boost with PMW-Bypass conserves significantly more budget compared to
vanilla PMW and simpler cache designs: at least 1.51 - 14.25x improvement in
experiments on public Covid19 and CitiBike datasets. Moreover, Boost
incorporates support for range-query workloads, such as timeseries or streaming
workloads, where opportunities exist to further conserve privacy budget through
DP parallel composition and warm-starting of PMW state. Our work thus
establishes both a coherent system design and the theoretical underpinnings for
effective caching in DP databases